Method and system for identifying image relatedness using link and page layout analysis

ABSTRACT

A method and system for determining relatedness of images of pages based on link and page layout analysis. A link analysis system determines relatedness between images by first identifying blocks within web pages, and then analyzing the importance of the blocks to web pages, web pages to blocks, and images to blocks. Based on this analysis, the link analysis system determines the degree to which each image is related to each other image. The link analysis system may also use the relatedness of images to generate a ranking of the images. The link analysis system may also generate a vector representation of the images based on their relatedness and apply a clustering algorithm to the vector representations to identify clusters of related images.

TECHNICAL FIELD

The described technology relates generally to analyzing web pages andparticularly to relatedness of images of web pages.

BACKGROUND

Many search engine services, such as Google and Overture, provide forsearching for information that is accessible via the Internet. Thesesearch engine services allow users to search for display pages, such asweb pages, that may be of interest to users. After a user submits asearch request that includes search terms, the search engine serviceidentifies web pages that may be related to those search terms. Toquickly identify related web pages, the search engine services maymaintain a mapping of keywords to web pages. This mapping may begenerated by “crawling and indexing” the web (i.e., the World Wide Web)to identify the keywords of each web page. To crawl the web, a searchengine service may use a list of root web pages to identify all webpages that are accessible through those root web pages. The keywords ofany particular web page can be identified using various well-knowninformation retrieval techniques, such as identifying the words of aheadline, the words supplied in the metadata of the web page, the wordsthat are highlighted, and so on. The search engine service then ranksthe web pages of the search result based on the closeness of each match,web page popularity (e.g., Google's PageRank), and so on. The searchengine service may also generate a relevance score to indicate howrelevant the information of the web page may be to the search request.The search engine service then displays to the user links to those webpages in an order that is based on their rankings.

Although many web pages are graphically oriented in that they maycontain many images, conventional search engine services typicallysearch based on only the textual content of a web page. Some attemptshave been made, however, to support image-based searching of web pages.For example, a user viewing a web page may want to identify other webpages that contain images related to an image on that web page. Theimage-based search techniques are typically either content-based orlink-based and additionally use surrounding text to aid in analyzingimages. The content-based techniques use low-level visual informationfor image indexing. Because the content-based search techniques are verycomputationally expensive, they are not practical for image searching onthe web.

The link-based search techniques typically assume that images on thesame web page are likely to be related and that images on web pages thatare each linked to by the same web page are related. Unfortunately,these assumptions are incorrect in many situations primarily because asingle web page may have content relating to many different topics. Forexample, a web page for a news web site may contain content relating toan international political event and content relating to a nationalsporting event. In such a case, it is unlikely that a picture of asports team relating to the national sporting event is related to a webpage linked to by the content relating to the international politicalevent.

It would be desirable to have an image-based search technique that wouldnot be computationally as expensive as conventional content-based searchtechniques and that, unlike conventional link-based search techniques,would account for the diverse topics that can occur on a single webpage.

SUMMARY

A system for determining relatedness of images of pages based on linkand page layout analysis is provided. A link analysis system determinesrelatedness between images by first identifying blocks within pages, andthen analyzing the importance of the blocks to pages, pages to blocks,and images to blocks. Based on this analysis, the link analysis systemdetermines the degree to which each image is related to each otherimage. Because the relatedness of an image to another image is based onblock-level importance, which is a smaller unit than a page, rather thanpage-level importance, this relatedness is a more accuraterepresentation of relatedness than conventional link-based searchtechniques.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating blocks, images, and links in asample collection of web pages.

FIG. 2 is a block diagram illustrating components of the link analysissystem in one embodiment.

FIG. 3 is a flow diagram that illustrates processing of a generateimage-to-image matrix component in one embodiment.

FIG. 4 is a flow diagram that illustrates the processing of a generateblock-to-page matrix component in one embodiment.

FIG. 5 is a flow diagram that illustrates the processing of a generatepage-to-block matrix component in one embodiment.

FIG. 6 is a flow diagram that illustrates the processing of a generateblock-to-image matrix component in one embodiment.

DETAILED DESCRIPTION

A method and system for determining relatedness of images of pages basedon link and page layout analysis is provided. In one embodiment, a linkanalysis system determines relatedness between images by firstidentifying blocks within web pages, and then analyzing the importanceof the blocks to web pages, web pages to blocks, and images to blocks.Based on this analysis, the link analysis system determines the degreeto which each image is related to each other image. A block of a webpage represents an area of the web page that appears to relate to asimilar topic. For example, a news article relating to an internationalpolitical event may represent one block, and a news article relating toa national sporting event may represent another block. The importance ofa block to a page may indicate a probability that a user will focus onthat block when viewing that page. The importance of a page to a blockmay indicate the probability that a user will select from that block alink to that page. The importance of an image to a block may indicatethe probability that a user will focus on that image when viewing thatblock. After calculating a numeric indicator of these importances forpairs of pages and blocks and pairs of images and blocks, the linkanalysis system generates an indicator of the relatedness of each imageto each other image by combining the calculated importance of a block toa page, the calculated importance of a page to a block, and thecalculated importance of an image to a block. Because the relatedness ofan image to another image is based on block-level importance rather thanon page-level importance, this relatedness is a more accuraterepresentation of relatedness than conventional link-based searchtechniques.

The link analysis system may also use the relatedness of images togenerate a ranking of the images. The ranking may be based on aprobability that a user who starts viewing an arbitrary image willtransition to another image after an arbitrarily large number oftransitions between images. The link analysis system may also generate avector representation of the images based on their relatedness and applya clustering algorithm to the vector representations to identifyclusters of related images.

FIG. 1 is a block diagram illustrating blocks, images, and links in asample collection of web pages. This collection of web pages includesweb pages 1-4. The blocks within the web pages are represented asrectangles, the images within blocks are represented as circles, and thelinks within blocks are represented as directed arrows from a block to alinked-to web page. Web page 1 contains block 1, which contains images 1and 2 and links 1 and 2. Web page 2 contains block 2, which containsimage 3 and link 3, and block 3, which contains image 4 and link 4. Webpage 3 contains block 4, which contains image 5 and links 5 and 6, andblock 5, which contains image 6 and link 7. Web page 4 contains block 5,which contains images 7, 8, 9, and 10 and link 8. Because the linkanalysis system bases image relatedness on blocks rather than entire webpages, the relatedness of an image to other images is likely based on amore accurate representation of the topic of an image. For example, webpage 2 contains blocks 2 and 3, which may be directed to differenttopics such as an international political event and a national sportingevent, respectively. The link analysis system may identify that image 4is more closely related to the images of web page 4 than to the imagesof web page 3, because block 3, which contains image 4, has a link 4 toweb page 4. For example, web page 4 is more likely sports-related thanis web page 3 because block 3 contains a link to web page 4, but not toweb page 3. As such, image 4 is more likely related to images 7, 8, 9,and 10 than to images 5 and 6 of web page 3. Techniques that are notbased on block-level analysis may identify that image 4 is equallyrelated to web page 3 and web page 4 because those techniques do notdistinguish block 2 from block 3 on web page 2.

In one embodiment, the link analysis system calculates the importance ofa page to a block, for each block and page combination, as theprobability that a user who selects a link of that block will select alink to that page. If a block does not have a link to a page, then theprobability is zero. If a block has a link to a page, then the linkanalysis system may assume a user will select each of the links of theblock with equal probability. A block-to-page matrix of probabilities isdefined by the following equation: $\begin{matrix}{Z_{ij} = \left\{ \begin{matrix}{1/s_{i}} & {{if}\quad{there}\quad{is}\quad a\quad{link}\quad{from}\quad{block}\quad i\quad{to}\quad{page}\quad j} \\{0} & {otherwise}\end{matrix} \right.} & (1)\end{matrix}$

where Z_(ij) represents the probability that a user who selects a linkof block i will select the link to page j and s_(i) is the number oflinks in block i. The block-to-page matrix Z for the web pages of FIG. 1is shown in Table 1. The rows of Table 1 represent the blocks and thecolumns represent the pages. In this example, the probability that auser who selects of link of block 4 will select a link to web page 2 is0.5. TABLE 1 1 2 3 4 1 .5 .5 2 1 3 1 4 .5 .5 5 1 6 1

In one embodiment, the link analysis system calculates, for each pageand block combination, the importance of a block to a page as theprobability of that block being the most important block of the page.The probability of a block not contained on a page being the mostimportant block of that page is zero. The link analysis system mayassume that each block contained on a page is most important with equalprobability. A page-to-block matrix of probabilities is defined by thefollowing equation: $\begin{matrix}{X_{ij} = \left\{ \begin{matrix}{1/s_{i}} & {{if}\quad{page}\quad i\quad{contains}\quad{block}\quad j} \\{0} & {otherwise}\end{matrix} \right.} & (2)\end{matrix}$where X_(ij) represents the probability that block j is the mostimportant block of page i and s_(i) is the number of blocks on page i.

In one embodiment, the link analysis system calculates a probabilitythat a block is the most important block of a page based on position,size, font, color, and other physical attributes of the block. Forexample, a large block that is centered in the middle of a page may bemore important than a small block in the lower left corner of the page.A technique for calculating block importance and the degree of coherencyof blocks is described in U.S. patent application Ser. No. ______,entitled, “Method and System for Calculating Importance of a BlockWithin a Display Page” and filed on Apr. 29, 2004, which is herebyincorporated by reference. The page-to-block matrix X may be moregenerally represented as: $\begin{matrix}{X_{ij} = \left\{ \begin{matrix}{f_{p_{i}}\left( b_{j} \right)} & {{if}\quad{page}\quad i\quad{contains}\quad{block}\quad j} \\{0} & {otherwise}\end{matrix} \right.} & (3)\end{matrix}$where ƒ_(p) ^(i) is a function representing the probability that block jis the most important block of page i. In one embodiment, the functionƒ_(p) ^(i) is defined as the size of block j divided by the distance ofthe center of the block from the center of the screen when page i isdisplayed. The function ƒ may be defined by the following:$\begin{matrix}{{f_{p_{i}}(b)} = {\alpha\frac{{size}\quad{of}\quad{block}\quad b\quad{in}\quad{page}\quad p_{i}}{{{dist}.\quad{from}}\quad{the}\quad{center}\quad{of}\quad b\quad{to}\quad{the}\quad{center}\quad{of}\quad{screen}}}} & (4)\end{matrix}$

where a is a normalization factor that ensures that the sum of thevalues of the function for a block is 1. The function f can beconsidered to be the probability that a user is focused on block j whenviewing page i. The page-to-block matrix X for the web pages of FIG. 1is shown in Table 2. The rows of Table 2 represent the pages and thecolumns represent the blocks. In this example, the probability thatblock 4 is the most important block of web page 3 is 0.8. TABLE 2 1 2 34 5 6 1 1 2 .5 .5 3 .8 .2 4 1

In one embodiment, the link analysis system calculates, for each blockand image combination, the importance of an image to a block as theprobability of that image being the most important image of that block.If a block does not contain a certain image, then the probability ofthat image being the most important of that block is zero. The linkanalysis system may assume that each image of a block is most importantwith equal probability. The link analysis system could use othermeasures of importance of an image to a block, such as based on therelative sizes of the images, the location of the images within theblocks, and so on. A block-to-image matrix of the probabilities isdefined by the following equation: $\begin{matrix}{Y_{ij} = \left\{ \begin{matrix}{1/s_{i}} & {{if}\quad{block}\quad i\quad{contains}\quad{image}\quad j} \\{0} & {otherwise}\end{matrix} \right.} & (5)\end{matrix}$

where Y_(ij) represents the probability that image j is the mostimportant image of block i and s_(i) is the number of images in block i.The block-to-image matrix Y for the web pages of FIG. 1 is shown inTable 3. The rows of Table 3 represent blocks and the columns representthe images. In this example, the probability that image 2 is the mostimportant image of block 1 is 0.5. TABLE 3 1 2 3 4 5 6 7 8 9 10 1 .5 .52 1 3 1 4 1 5 1 6 .25 .25 .25 .25

In one embodiment, the link analysis system calculates the importance ofone page to another page, for each ordered pair of pages, as theprobability that a user viewing the first page of the pair will select alink to the second page of the pair. The link analysis system calculatesthe probability for each pair by summing for each block of the firstpage the probability of that block being the most important block of thefirst page times the probability that the second page is the mostimportant page to that block. The importance of a page to another pagethus factors in that users may prefer to select links within the mostimportant blocks of page. A page-to-page matrix of these probabilitiesis represented by the following:W _(p) =XZ   (6)where W_(p) represents the page-to-page matrix. The probability of W canalternately be represented as: $\begin{matrix}{{{Prob}\left( {\beta ❘\alpha} \right)} = {\sum\limits_{b \in \alpha}{{{Prob}\left( {\beta ❘b} \right)}{{Prob}\left( {b❘\alpha} \right)}}}} & (7)\end{matrix}$

where α represents the first page of the pair and β represents thesecond page of the pair. The page-to-page matrix W_(p) for the web pagesof FIG. 1 is shown in Table 4. In this example, the probability that auser viewing page 3 will transition to page 2 is 0.4. TABLE 4 1 2 3 4 10 .5 .5 0 2 0 0 .5 .5 3 .2 .4 0 .4 4 0 0 1 0

The link analysis system calculates, for each ordered pair of blocks,the importance of one block to another block as the probability that auser viewing the first block of the pair will select a link to the pagecontaining the second block of the pair and will find that second blockto be the most important of its page. The link analysis systemcalculates the probability for each pair by summing the probabilitiesthat a user who selects a link of the first block will select a link forthe page that contains the second block times the probability of thatsecond block being the most important block of its page. Thus, theimportance of one block to another block represents that a user viewingthe first block will select a link to the page containing the secondblock and focus their attention on the second block. A block-to-blockmatrix of these probabilities is represented by the following:W _(B) =ZX   (8)where W_(B) represents the block-to-block matrix. The probabilities of Wcan alternatively be represented as: $\begin{matrix}\begin{matrix}{{W_{B}\left( {a,b} \right)} = {{Prob}\left( {b❘a} \right)}} \\{= {\sum\limits_{\gamma \in P}{{{Prob}\left( {\gamma ❘a} \right)}{{Prob}\left( {b❘\gamma} \right)}}}} \\{= {{{Prob}\left( {\beta ❘a} \right)}{{Prob}\left( {b❘\beta} \right)}}} \\{{= {{Z\left( {a,\beta} \right)}{X\left( {\beta,b} \right)}}},\quad a,{b \in B}}\end{matrix} & (9)\end{matrix}$

The block-to-block matrix W_(B) for the web pages of FIG. 1 is shown inTable 5. In this example, the probability that a user viewing block 4will jump to page 2 and focus their attention on block 3 is 0.25. TABLE5 1 2 3 4 5 6 1 0 .25 .25 .4 .1 0 2 0 0 .8 .2 0 0 3 0 0 0 0 0 1 4 0 .25.25 0 0 .5 5 1 0 0 0 0 0 6 0 0 .8 .2 0 0

In one embodiment, the link analysis system factors into theblock-to-block matrix the probability that two blocks on the same pagemay be related. The revised block-to-block matrix is represented by thefollowing:W _(B)=(1−t)ZK+tDU   (10)where D is a diagonal matrix D_(ii)=Σ_(j)U_(ij), U is a coherencematrix, and t is a weighting factor. The matrix U is defined as follows:$\begin{matrix}{U_{ij} = \left\{ \begin{matrix}{0} & {{if}\quad{block}\quad i\quad{and}\quad{block}\quad j\quad{are}\quad{on}\quad{different}\quad{pages}} \\{DOC} & {otherwise}\end{matrix} \right.} & (11)\end{matrix}$where DOC is the degree of coherency of the smallest block containingboth block i and block j. The weighting factor t may typically be set toa small value (e.g., less than 0.1) because in most instances differentblocks on the same page relate to different topics.

The link analysis system calculates for each ordered pair of images theprobability that the first image of the pair is related to the secondimage of the pair. The link analysis system calculates the probabilityby summing the block-to-block abilities for the combination of eachblock that contains the first image to each block that contains thesecond image. An image-to-image matrix of these probabilities isrepresented by the following:W _(I) =Y ^(T) W _(B) Y   (12)

where W_(I) represents the image-to-image matrix. The image-to-imagematrix W_(I) for the web pages of FIG. 1 is shown in Table 6. In thisexample, the probability that a user viewing block 10 will next viewpage 3 and focus on block 5 is 0.05. TABLE 6 1 2 3 4 5 6 7 8 9 10 1 0 0.125 .125 .2 .05 0 0 0 0 2 0 0 .125 .125 .2 .05 0 0 0 0 3 0 0 0 .8 .2 00 0 0 0 4 0 0 0 0 0 0 .25 .25 .25 .25 5 0 0 .25 .25 0 0 .125 .125 .125.125 6 .5 .5 0 0 0 0 0 0 0 0 7 0 0 0 .2 .05 0 0 0 0 0 8 0 0 0 .2 .05 0 00 0 0 9 0 0 0 .2 .05 0 0 0 0 0 10 0 0 0 .2 .05 0 0 0 0 0

In one embodiment, the link analysis system factors into theimage-to-image matrix the probability that two blocks on the same pagemay be related. The revised image-to-image matrix is represented by thefollowing:W _(I) =tDY ^(T) Y+(1−t)Y ^(T) W _(B) Y   (13)where t is a weighting factor and D is a diagonal matrix representingD _(ii) =E _(j)(Y ^(T) Y)_(ij)   (14)The weighting factor t may be set to a large value (e.g., 0.7-0.9)because two images in the same block are likely to be related.

In one embodiment, the link analysis system generates a vectorrepresentation of each image from the image-to-image matrix. The linkanalysis system generates the vectors using a least-squares approachthat factors in the similarity between a pair of images as indicated bythe image-to-image matrix. The link analysis system initially convertsthe image-to-image matrix to a similarity matrix represented by thefollowing:S=(W _(I) +W _(I) ^(T))/2   (15)where S represents the similarity matrix. If y_(i) is a vectorrepresentation of image i, then the optimal set of image vectors isy=(y_(l), . . . , y_(m)) obtained using the following objectivefunction: $\begin{matrix}{\min\limits_{y}{\sum\limits_{i,j}{\left( {y_{i} - y_{j}} \right)^{2}S_{ij}}}} & (16)\end{matrix}$If D is a diagonal matrix such that D_(ii) is the sum of the values ofthe i^(th) row of the similarity matrix S, then the minimization problemreduces to the following: $\begin{matrix}{\min\limits_{{y^{T}y} = 1}{y^{T}{Ly}}} & (17)\end{matrix}$where L is equal to D-S. The solution is given by the minimum eigenvaluesolution to the general eigenvalue problem:Ly=λy   (18)If (y⁰, λ⁰), . . . , (y¹, λ¹), (y^(m-1), λ^(m-1)) are solutions toEquation 16, and λ⁰<λ¹< . . . <λ^(m-1) , then λ⁰=0 and y⁰=(1, 1, . . . ,1). The link analysis system selects eigenvectors I through K torepresent the images in a k-dimensional Euclidean space. The vector foran image is represented as follows:image j←(y ¹(j), . . . , y ^(k)(j))   (19)where y^(i)(j) denotes the j^(th) element of y^(i).

The link analysis system identifies clusters of related images byrepresenting each image by a vector such that the distance between theimage vectors represents their semantic similarity. Various clusteringalgorithms may be applied to the image vectors to identify clusters ofsemantically related images. These clustering algorithms may include aFiedler vector from spectral graph theory, a k-means clustering, and soon.

The clustering of images can be used to assist in browsing. For example,when browsing to a web page, a user can select an image and request tosee related images. The web pages that contain the images that areclustered together with the selected image can then be presented as theresult of the request. In one embodiment, the web pages can be presentedin an order that is based on the distance between the image vector ofeach image and the image vector of the selected image.

The clustering of images can also be used to provide a multidimensionalvisualization of images that are semantically related. The image vectorscan be generated for the images of a collection of web pages. Once theclusters are identified, the system can display an indication of eachcluster on a two-dimensional grid representing clusters based ondifferent eigenvectors.

The link analysis system can rank images based on the image-to-imagematrix. The image-to-image matrix represents the probability oftransitioning from image to image. It is possible that a user willtransition to an image randomly. To account for this, the link analysissystem generates a probability transition matrix that factors thisrandomness into the image-to-image matrix as follows:P=εW+(1−ε)U   (20)where P is a probability transition matrix, e is a weighting factor(e.g., 0.1˜0.2), and U is a transition matrix of uniform transitionprobabilities (U_(ij)=1/m for all i, j). Because of the introduction ofU, the graph is connected and a stationary distribution of a random walkof the graph exists. The rank of an image can be represented as follows:P ^(T)π=π  (21)where π is an eigenvector of P^(T) with eigenvalue 1 representing theimage rank. π=(π_(i), π_(l), . . . , π_(m)) represents a stationaryprobability distribution and π_(i) represents the rank of image i.

FIG. 2 is a block diagram illustrating components of the link analysissystem in one embodiment. The link analysis system 200 includes a webpage store 201, a calculate image rank component 202, an identify imageclusters component 203, and a generate image-to-image matrix component211. The generate image-to-image matrix component 211 uses an identifyblocks component 212, a generate block-to-page matrix component 213, agenerate page-to-block matrix component 214, and a generateblock-to-image matrix component 215 to generate a matrix that indicatesthe image-to-image relatedness. The web page store contains thecollection of web pages. The calculate image rank component uses thegenerate image-to-image component to calculate the relatedness of theimages and then uses those calculations of relatedness to rank theimages. The identify image clusters component uses the generateimage-to-image matrix component to calculate the relatedness of theimages, generates a vector representation of the images based on thematrix, and identifies clusters of images using the generated vectors.Although not shown in FIG. 2, the link analysis system may also includea component to calculate ranking elements of a web page other than theimages. For example, the link analysis system may apply the rankings ofEquations 20 and 21 to the block-to-block matrix to rank the blocks andto the page-to-page matrix to rank the pages themselves.

The computing device on which the link analysis system is implementedmay include a central processing unit, memory, input devices (e.g.,keyboard and pointing devices), output devices (e.g., display devices),and storage devices (e.g., disk drives). The memory and storage devicesare computer-readable media that may contain instructions that implementthe link analysis system. In addition, the data structures and messagestructures may be stored or transmitted via a data transmission medium,such as a signal on a communications link.

Various communications links may be used, such as the Internet, a localarea network, a wide area network, or a point-to-point dial-upconnection.

FIG. 2 illustrates an example of a suitable operating environment inwhich the link analysis system may be implemented. The operatingenvironment is only one example of a suitable operating environment andis not intended to suggest any limitation as to the scope of use orfunctionality of the link analysis system. Other well-known computingsystems, environments, and configurations that may be suitable for useinclude personal computers, server computers, hand-held or laptopdevices, multiprocessor systems, microprocessor-based systems,programmable consumer electronics, network PCs, minicomputers, mainframecomputers, distributed computing environments that include any of theabove systems or devices, and the like.

The link analysis system may be described in the general context ofcomputer-executable instructions, such as program modules, executed byone or more computers or other devices. Generally, program modulesinclude routines, programs, objects, components, data structures, etc.,that perform particular tasks or implement particular abstract datatypes. Typically, the functionality of the program modules may becombined or distributed as desired in various embodiments.

FIG. 3 is a flow diagram that illustrates processing of a generateimage-to-image matrix component in one embodiment. In block 301, thecomponent identifies the blocks within the web pages stored in the webpage store. In block 302, the component invokes the generateblock-to-page matrix component. In block 303, the component invokes thegenerate page-to-block matrix component.

In block 304, the component invokes the generate block-to-image matrixcomponent. In block 305, the component generates the block-to-blockmatrix. In block 306, the component generates the image-to-image matrixand then completes.

FIG. 4 is a flow diagram that illustrates the processing of a generateblock-to-page matrix component in one embodiment. In blocks 401-408, thecomponent loops selecting each page, each block within each page, andeach link within each block and sets the importance of the pages linkedto by that link, to that block. In block 401, the component selects thenext page. In decision block 402, if all the pages have already beenselected, then the component returns the block-to-page matrix, else thecomponent continues at block 403. In block 403, the component selectsthe next block of the selected page. In decision block 404, if all theblocks of the selected page have already been selected, then thecomponent loops to block 401 to select the next page, else the componentcontinues at block 405. In block 405, the component counts the number oflinks within the selected block. In block 406, the component selects thelinked-to page of the next link of the selected block. In decision block407, if all the linked-to pages of the selected block have already beenselected, then the component loops to block 403 to select the nextblock, else the component continues at block 408. In block 408, thecomponent sets the importance of the linked-to page to the block andthen loops to block 406 to select the linked-to page of the next link ofthe selected block.

FIG. 5 is a flow diagram that illustrates the processing of a generatepage-to-block matrix component in one embodiment. In blocks 501-506, thecomponent loops selecting each page and each block within each page andsetting the importance of that block to the selected page. In block 501,the component selects the next page of the web page store. In decisionblock 502, if all the pages have already been selected, then thecomponent returns the page-to-block matrix, else the component continuesat block 503. In block 503, the component selects the next block of theselected page. In decision block 504, if all the blocks of the selectedpage have already been selected, then the component loops to block 501to select the next page, else the component continues at block 505. Inblock 505, the component calculates the importance of the selected blockto the selected page. In block 506, the component sets the importance ofthe selected block to the selected page and then loops to block 503 toselect the next block of the selected page.

FIG. 6 is a flow diagram that illustrates the processing of a generateblock-to-image matrix component in one embodiment. In blocks 601-607,the component loops selecting each page, each block within each page,and each image within each block and setting the importance of the imageto the selected block. In block 601, the component selects the next pageof the web page store. In decision block 602, if all the pages havealready been selected, then the component returns the block-to-imagematrix, else the component continues at block 603. In block 603, thecomponent selects the next block of the selected page. In decision block604, if all the blocks of the selected page have already been selected,then the component loops to block 601 to select the next page, else thecomponent continues at block 605. In block 605, the component counts thenumber of images of the selected block. In block 606, the componentselects the next image of the selected block. In decision block 607, ifall the images of the selected block have already been selected, thenthe component loops to block 603 to select the next block, else thecomponent continues at block 608. In block 608, the component sets theimportance of the selected image to the selected block and then loops toblock 606 to select the next image of the selected block.

One skilled in the art will appreciate that although specificembodiments of the link analysis system have been described herein forpurposes of illustration, various modifications may be made withoutdeviating from the spirit and scope of the invention. Accordingly, theinvention is not limited except by the appended claims.

1. A method in a computer system for determining relatedness betweenimages within blocks of pages, the method comprising: calculatingindicators of importance of a block to a page; calculating indicators ofimportance of a page to a block; calculating indicators of importance ofan image to a block; and calculating image-to-image indicators ofrelatedness of an image to another image by combining the indicators ofimportance of a block to a page, the indicators of importance of a pageto a block, and the indicators of importance of an image to a block. 2.The method of claim 1 wherein the indicators of importance of a page toa block are probabilities that a user will select a link from each blockthat will lead to each other page.
 3. The method of claim 1 wherein theindicators of importance of a block to a page are probabilities that auser will focus on each block of the page.
 4. The method of claim 1wherein the indicators of importance of an image to a block areprobabilities that a user will focus on each image of each block.
 5. Themethod of claim 1 wherein the indicators of importance of a page to ablock are probabilities that a user will select a link from each blockthat will lead to each other page, the indicators of importance of ablock to a page are probabilities that a user will focus on each blockof the page, and the indicators of importance of an image to a block areprobabilities that a user will focus on each image of each block.
 6. Themethod of claim 1 including calculating a rank of the images from theimage-to-image indicators.
 7. The method of claim 6 wherein thecalculated rank is based on a probability that a user starting at anarbitrary image will transition to another image after an arbitrarilylarge number of transitions between images.
 8. The method of claim 1wherein the image-to-image indicators are calculated as follows:W _(I) =Y ^(T) W _(B) Y where W_(I) is a matrix of the image-to-imageindicators, Y is a matrix of image-to-block indicators, andW _(B) =ZX where W_(B) is a matrix of block-to-block indicators, Z is amatrix of the indicators of importance of a page to a block, and X is amatrix of the indicators of importance of a block to a page.
 9. Themethod of claim 1 including: generating a vector representation of eachimage based on the image-to-image indicators; and identifying clustersof images based on their vector representations wherein images in acluster are related.
 10. A method in a computer system for determiningrelatedness between blocks of pages, the method comprising: calculatingindicators of importance of a page to a block; calculating indicators ofimportance of a block to a page; and calculating block-to-blockindicators of relatedness of one block to another block by combining theindicators of importance of a block to a page and the indicators ofimportance of a page to a block.
 11. The method of claim 10 wherein theindicators of importance of a page to a block are probabilities that auser will select a link from each block that will lead to each otherpage.
 12. The method of claim 10 wherein the indicators of importance ofa block to a page are probabilities that a user will focus on each blockof the page.
 13. The method of claim 10 wherein the indicators ofimportance of a page to a block are probabilities that a user willselect a link from each block that will lead to each other page and theindicators of importance of a block to a page are probabilities that auser will focus on each block of the page.
 14. The method of claim 10including calculating a rank of the blocks from the block-to-blockindicators.
 15. The method of claim 14 wherein the calculated rank isbased on a probability that a user starting at an arbitrary block willtransition to another block after an arbitrarily large number oftransitions between blocks.
 16. The method of claim 10 wherein theblock-to-block indicators are calculated as follows:W _(B) =ZX where X is a matrix of the indicators of importance of ablock to a page and Z is a matrix of the indicators of importance of apage to a block.
 17. A method in a computer system for determiningrelatedness between pages having blocks, the method comprising:calculating indicators of importance of a page to a block; calculatingindicators of importance of a block to a page; and calculatingpage-to-page indicators of relatedness of one page to another page bycombining the block-to-page indicators and the page-to-block indicators.18. The method of claim 17 wherein the indicators of importance of apage to a block are probabilities that a user will select a link fromeach block that will lead to each other page.
 19. The method of claim 17wherein the indicators of importance of a block to a page areprobabilities that a user will focus on each block of the page.
 20. Themethod of claim 17 wherein the indicators of importance of a block to apage are probabilities that a user will focus on each block of the pageand the indicators of importance of a page to a block are probabilitiesthat a user will select a link from each block that will lead to eachother page.
 21. The method of claim 17 including calculating a rank ofthe pages from the page-to-page indicators.
 22. The method of claim 21wherein the calculated rank is based on a probability that a userstarting at an arbitrary page will transition to another page after anarbitrarily large number of transitions between pages.
 23. The method ofclaim 17 wherein the page-to-page indicators are calculated as follows:W _(P) =XZ where W_(P) is a matrix of page-to-page indicators, X is amatrix of the indicators of importance of a block to a page, and Z is amatrix of the indicators of importance of a page to a block.
 24. Amethod in a computer system for identifying related images on pageshaving links, each link being from a block on a page containing an imageto a page having another block that contains another image, the methodcomprising: for each image, calculating a probability for each otherimage that if a user is viewing the image the user will select a linkfrom the block on a page containing that image that is to another pagehaving a block that contains the other image; for each image, generatinga vector representation of the image based on the calculatedprobabilities; and identifying clusters of images based on their vectorrepresentations wherein images in a cluster are related.
 25. The methodof claim 24 wherein the generating of a vector representation includesselecting vector representations that minimize an objective function.26. The method of claim 26 wherein the objective function is the sum ofthe square of the distance between the vector representations for eachpair of images times a similarity for the pair of images that is derivedfrom the calculated probabilities.
 27. The method of claim 24 whereinthe calculating of the probability includes calculating probabilitiesthat indicate a probability that a user will select a link from eachblock that will lead to each other page, probabilities that indicate aprobability that a user will focus on each block of the page, andprobabilities that indicate a probability that a user will focus on eachimage of each block.
 28. A computer-readable medium containinginstructions for controlling a computer system to determine relatednessbetween page elements, the method comprising: calculating indicators ofimportance of a first element to a second element; calculatingindicators of importance of a second element to a first element; andcalculating indicators of relatedness of a first element to anotherfirst element by combining the indicators of importance of a firstelement to a second element and the indicators of importance of a secondelement to a first element.
 29. The computer-readable medium of claim 28wherein the first element is a page and the second element is a block ofa page.
 30. The computer-readable medium of claim 28 wherein the firstelement is a block of a page and the second element is a page.
 31. Thecomputer-readable medium of claim 28 wherein the first element is animage of a block of a page and the second element is a block.
 32. Thecomputer-readable medium of claim 28 wherein the indicators ofimportance are probabilities.
 33. A computer system for determiningrelatedness between images within blocks of pages, comprising:indicators of importance of a page to a block; indicators of importanceof a block to a page; indicators of importance of an image to a block;and means for calculating image-to-image indicators of relatedness of animage to another image by combining the indicators of importance of ablock to a page, the indicators of importance of a page to a block, andthe image-to-block indicators.
 34. The computer system of claim 33including means for calculating indicators of importance of a page to ablock as probabilities that a user will select a link from each blockthat will lead to each other page.
 35. The computer system of claim 33including means for calculating indicators of importance of a block to apage as probabilities that a user will focus on each block of the page.36. The computer system of claim 33 including means for calculating theindicators of importance of an image to a block as probabilities that auser will focus on each image of each block.
 37. The computer system ofclaim 33 including means for calculating a rank of the images from theimage-to-image indicators.
 38. The computer system of claim 37 whereinthe calculated rank is based on a probability that a user starting at anarbitrary image will transition to another image after an arbitrarilylarge number of transitions between images.
 39. The computer system ofclaim 33 including: means for generating a vector representation of eachimage based on the image-to-image indicators; and means for identifyingclusters of images based on their vector representations wherein imagesin a cluster are related.